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Abstract 

This article treats the problem of learning a dictionary providing sparse representations for a given signal class, 
via £1 -minimisation. The problem can also be seen as factorising a d x N matrix Y = (yi . . .j/jv), y n G K of 
training signals into a d x K dictionary matrix 4> and a K x N coefficient matrix X = (xi . . . xn), x n S R , 
which is sparse. The exact question studied here is when a dictionary coefficient pair can be recovered as 

local minimum of a (nonconvex) l\ -criterion with input Y = First, for general dictionaries and coefficient 

matrices, algebraic conditions ensuring local identifiability are derived, which are then specialised to the case when 
the dictionary is a basis. Finally, assuming a random Bernoulli-Gaussian sparse model on the coefficient matrix, it is 
shown that sufficiently incoherent bases are locally identifiable with high probability. The perhaps surprising result 
is that the typically sufficient number of training samples N grows up to a logarithmic factor only linearly with the 
signal dimension, i.e. N « CK log K, in contrast to previous approaches requiring combinatorially many samples. 
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sparse representation of the signals of interest. Moreover, a huge body of recent results on sparse representations 

03 

has highlighted their impact on inverse linear problems such as (blind) source separation and localisation as well 
as compressed sampling, for a starting point see e.g. [25], [12], [9|, E71 . 

In any of these publications, one will - more likely than not - find a statement starting with 'given a dictionary $ 
and a signal y having an S'-sparse approximation/representation y = $>x . . . ', which points exactly to the remaining 
problem: all applications of sparse representations rely on a signal dictionary $ from which sparse linear expansions 



I. Introduction 

Many signal processing tasks, such as denoising and compression, can be efficiently performed if one knows a 
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can be built that efficiently approximate the signals from a class of interest; success heavily depends on the good 
fit between the data class and the dictionary. 

For many signal classes, good dictionaries - such as time-frequency or time-scale dictionaries - are known, but 
new data classes may require the construction of new dictionaries to fit new types of data features. The analytic 
construction of dictionaries such as wavelets and curvelets stems from deep mathematical tools from Harmonic 
Analysis. It may, however, be difficult and time consuming to develop complex mathematical theory each time 
a new class of data, which requires a different type of dictionary, is met. An alternative approach is dictionary 
learning, which aims at infering the dictionary from a set of training data y n . Dictionary learning, also known 
as sparse coding, has the potential of 'industrialising' sparse representation techniques for new data classes. 
This article treats the theoretical dictionary learning problem, expressed as a factorisation problem which consists 
of identifying a d x K matrix 4> from a set of A*" observed training vectors y n 6 R d , knowing that y n = $x n , 
1 < n < N for some unknown collection of coefficient vectors x n € M. K with certain statistical properties. 
Considering the extensive literature available for the sparse decomposition problem after the early work in iflOl . 
El, 0, H, [26 1, surprisingly little work has been dedicated to theoretical dictionary learning so far. There 
exist several dictionary learning algorithms (see e.g. ifTTI . ifTBI , |Q], |[T5l ), but only recently people have started 
to consider also the theoretical aspects of the problem. The origins of research into what is now called dictionary 
learning can be found in the field of Independent Component Analysis (ICA) Q, Q. There, many identifiability 
results are available, which, however, rely on asymptotic statistical properties under statistical independence and 
non-Gaussianity assumptions. 

In contrast, Georgiev, Theis and Cichocki, 11131 . as well as Aharon, Elad and Bruckstein, Q, described more 
geometric identifiability conditions on the sparse coefficients of training data in an ideal (overcomplete) dictionary. 
Yet, for these conditions to hold, the size N of the training set seems to be required to grow exponentially fast 
with the number of atoms K, and the provably good identification algorithms are combinatorial. Moreover, the 
algorithms and the identifiability analysis are not robust to 'outliers', i.e., training samples y n where x n fails to 
be sufficiently sparse. For applications, on the other hand, we are concerned with relatively large-dimensional data 
(e.g. d = 30, or even d = 1000) but limited availability of training data (N is not much larger than say 1000 • d) 
as well as limited computational resources. 

In this article, we study the possibility of designing provably good, non-combinatorial dictionary learning algorithms 
that are robust to outliers and to the limited availability of training samples. Inspired by recent proofs of good 
properties of l\ -minimisation for sparse signal decomposition with a given dictionary, we investigate the properties 
of ^i-based dictionary learning, (29], 11231 . Our ultimate goal, described in details in Section |llj is to characterise 
properties that a set of training samples y n , 1 < n < N should satisfy to guarantee that an ideal dictionary is the 
only local minimum of the l\ -criterion, opening up the possibility of replacing combinatorial learning algorithms 
with efficient numerical descent techniques. As a first step, we investigate conditions under which an ideal dictionary 
is a local minimum of the l\ -criterion. 

Main results. First, we describe the proposed setting in Section [II] and characterise the local minima of the ^i-cost 
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function in Section III We discuss the geometrical interpretation of this characterisation in Section IV Then, using 



concentration of measure, we prove in Section [V] the perhaps surprising result that when 

N > CK log K, 

if the samples x n , 1 < n < N, are a typical draw from a Bernoulli-Gaussian random distribution (which can 
generate a large proportion of outliers), then any sufficiently incoherent basis matrix i.e. K = d, is a local 
minimum of the cost function and is therefore 'locally identifiable'. The constant C depends on a parameter of the 
Bernoulli-Gaussian distribution which drives the sparsity of the training set. 

This number of training samples is surprisingly small considering that N training samples provide N x K > 
CK 2 \ogK real parameters, while the basis matrix 3? is essentially parameterised by 0(K 2 ) independent real 
parameters. 

In the considered matrix identification setting, it should be noted that l\ is not a convex cost function. It admits 
several local minima hence local identifiability only implies that, upon good initial conditions, numerical optimisation 
schemes performing the l\ -optimisation will recover the desired matrix 3?. However, empirical experiments in low 



dimension (d = 2), shown in Section VI indicate that for typical draws of Bernoulli-Gaussian training samples x n , 
the matrix is in fact the only local minimum of the criterion (up to natural indeterminacies of the problem such as 
column permutation). If this empirical observation could be turned into a theorem for general dimension K under 
the Bernoulli-Gaussian sparse model, this would imply that typically: a) l\ -minimisation is a good identification 
principle; b) any decent l\ -descent algorithm is a good identification algorithm . 

II. Setting 

In the vector space % — M. d of <i-dimensional signals, a dictionary is a collection of K > d vectors 1 < k < K, 
and it is said to be complete if its columns span the whole space. Alternatively, a dictionary can be seen as a d x K 
matrix <&. For a given signal y G H, the sparse representation problem consists of finding a representation y = <fr • x 
where x G R K is a 'sparse' vector, i.e. with few significantly large coefficients and most of its coefficients negligible. 



A. Sparse Representation by ti-Minimisation, with a Known Dictionary 

For a given dictionary, selecting an 'ideal' sparse representation of some data vector y E H amounts to solving the 
problem 

min||x||o, such that &x = y (1) 

X 

where the £ pseudo-norm ||a:||o counts the number of nonzero entries in the vector x. However, being nonconvex 
and nonsmooth, ([T} is hard to solve and has indeed been shown to be an NP-hard problem 0, fl8l . As a result 
people turned to non optimal strategies like greedy algorithms or the Basis Pursuit Principle. There the problem 
above is replaced by its convex relaxation 

min ||ac|| i, such that <&x = y. (2) 
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The good news is that when y admits a sufficiently sparse representation the solution of the relaxed problem 
coincides with the solution of the original one, compare [14|, J9|, El, |26|. 

B. Dictionary Learning from a Collection of Training Samples 

A related problem is that of finding the dictionary that will fit a class of signals, in the sense that it will provide 
sparse representations for all signals of the class. The first idea is to find the dictionary allowing representations 
with the most zero coefficients, i.e. given N signals y n 6 %, 1 < n < N ', and a candidate dictionary 4>, one can 
measure the global sparsity as 

N 

^ min ||x n || , such that $x n = y n , Vra. 

n=i " 

Collecting all signals y n (considered as column vectors in M. d ) into a d x N matrix Y and all coefficients x n 
(considered as column vectors in M. K ) into a K x N matrix X, the fit between a dictionary <I? and the training 
signals Y can be measured by the cost function 

C (*|y) := min ||X|| , 
x | 

where ||X||o := ^2 n ||;En||o counts the total number of nonzero entries in the K x N matrix X. Thus to get the 
dictionary providing the most zero coefficients out of a prescribed collection T> of admissible dictionaries, we should 
consider the criterion 

minC (*|F). (PO) 

The problem is that already finding the representation with minimal non-zero coefficients for one signal in a 
given dictionary is np-hard, which makes trying to solve ( |P0| > indeed a daunting task. Fortunately the problem 
above is not only daunting but also rather uninteresting, since it is not stable with respect to noise or suited to 
handle signals that are only compressible. Thus the idea of learning a dictionary via l\ -minimisation is motivated 
on the one hand by the goal to have a criterion that is taking into account that the signals might be noisy or only 
compressible and on the other by the success of the Basis Pursuit principle for finding sparse representations. There 
the ^o-pseudo norm was replaced with the £i-norm, which also promotes sparsity but is convex and continuous. 
The same strategy can be applied to the dictionary learning problem and the Iq-cosX function can be replaced with 
the i^-cost function 

d(*|Y):= min \\X\\ U (3) 

X | &X=Y 

where ||X||i := £)„ ||x n ||i. Several authors, 129), O, G2, Q9), El, ED, GU, have proposed to consider the 
corresponding minimisation problem 

minCi(#|Y). (PI) 
Unlike for the sparse representation problem, where this change meant a convex relaxation, the dictionary learning 
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problem (jPTJ is still not convex and cannot be immediately addressed with generic convex programming algorithms^] 

However, it seems better behaved than the original problem ( |P0| | because of the continuity of the criterion with 

respect to increasing amounts of noise, which makes it more amenable to numerical implementation. 

Looking at the problem above, we see that in order to solve it we still need to define T>, the set of admissible 

dictionaries. 

C. Constraints on the Dictionary 

Several families of admissible dictionaries can be considered such as discrete libraries of orthonormal bases (wavelet 
packets or cosine packets, for which fast dictionary selection is possible using tree-based searches @). Here we 
focus on the 'non parametric' learning problem where the full d x K matrix 3? has to be learned. Since the value 
of the criterion in {FT} can always be decreased by jointly replacing 3? and X with a& and X/a, < a < 1, a 
scaling constraint is necessary and a common approach is to only search for the optimum of {FT} within a bounded 
domain T>. 

We propose to concentrate on inequality constraints of the forrrj^jmaxfc ||<pfc||2 < C. Because of the homogeneity 
of the criterion with respect to scaling, we can assume without loss of generality that C = 1. We also let the reader 
check that the optimum of {FT} with the considered inequality constraints is indeed achieved when there is equality, 
see also [16], [28]. Hence we define the following constraint manifold 

V:={<S>,Vk,\\<p k \\ 2 = l}. (4) 

Let us turn now to the special aspect of dictionary learning treated in this paper. 

D. Dictionary Recovery: the Identification Problem 

Several algorithms have been proposed which adopt an £i minimisation approach to learning a dictionary, ifTTl . 
Ifl6l . 123], from training data. Their empirical behaviour has been explored, showing their ability to often recover 
with good precision the underlying dictionary. 

Here we are interested in the more theoretical problem of dictionary identification by ^-minimisation: assuming 
that the data Y were generated from an 'ideal' dictionary <f>Q E 1) and 'ideal' coefficients Xq as Y = <I?o^o, we 
want to determine conditions on Xq (and to a lesser extent on 3»o) sucn that the minimisation of {FT} recovers 
$o- Our objective is therefore similar in spirit to previous work on dictionary recovery ||T3l . which studied 
the uniqueness of overcomplete dictionaries for sparse component analysis. The main difference here is that we 

'The problem investigated here should not be confused with the problem of sparse channel estimation considered by Pfander, Rauhut and 
Tanner in |20|. There the goal is to identify a transmission channel # by an appropriate choice of input sequence x and the observation of 
y = &x. The approach is to model # = ct£&£ with sparse coefficients a in a known dictionary of "atomic channels", and to solve the 
convex problem min ||a||i subject to y = a?(<&ix). Here, we do not have the freedom to choose x nor do we know the channel dictionary, 
and the problem we consider is no longer convex. 

2 Other constraints which replace the norm ||</3fc||2 with, e.g., a norm ||</3fc||i, would also be interesting to study when it is desirable to obtain 
sparse atoms and not only sparse coefficients. 
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specify in advance which optimisation criterion we want to use to recover the dictionary {i\ -minimisation) and 
attempt to express conditions on a matrix Xo to guarantee that this method will successfully recover a given class 
of dictionaries. 

Permutation and sign ambiguity. The first problem we face consists of the ambiguities, which have been well 
known since the development of ICA. Because of the normalisation constraint we are assuming on the dictionary, the 
usual scaling ambiguity is avoided, but there remains a permutation and a sign ambiguity: for any permutation matrix 
P and D any diagonal matrix with unit diagonal entries we have *A = (§P _1 D _1 )(DPX). Hence Problem ( |P1[ ) 
has not just one but a whole equivalence class of minimisers, each of them corresponding to a matching column 
resp. row permutation and sign change of * resp. X. Therefore, we have to relax our requirement and only ask to 
find conditions such that minimising ( |P1| | recovers *o U P to permutation and sign change. The notation * ~ *o 
will indicate this indeterminacy, meaning that * = *oPD for some permutation matrix P and diagonal matrix D 
with unit diagonal entries. 

Global identiflability vs local identiflability. Ideally, we would like to characterise coefficient matrices Ao such 
that, for any *o e T> (or at least for a reasonable subset of T> such as, for instance, 'incoherent' dictionaries), the 
global minima of 

minCi(* |*o*o) (5) 

can only be found at * ~ *o. 

An even more ambitious objective would be to characterise coefficient matrices such that the local minima of |5]) 
can only be found at * ~ * > which would guarantee that numerical optimisation algorithms cannot be trapped 
in spurious local minima, and would converge independently of their initialisation. This objective raises two 
complementary questions: 

a) Local identiflability: Which conditions on Ao (and *o) guarantee that *o is a local minimum of the €i-cost 
function? 

b) Uniqueness: Which conditions guarantee that, when * is a local minimum of the ^i-cost function, it must 
match *o up to column permutation and sign change? 

In this paper we concentrate on the first question. The characterisation of local minima of the l\ criterion that we 
carry out in Section [Til] will certainly serve to address the second question in future work. 

Ideally sparse training samples vs non-sparse outliers In contrast to previous theoretical work on dictionary 
uniqueness lfl"3l . Q, we wish to determine identification conditions that do not rely on the unrealistic assumption 
that each training sample is ideally sparse. As a first step to deal with training data which may contain training 
samples y n = *o£„ with non-sparse coefficients x n , we consider in Section [V] a Bernoulli-Gaussian model and 
show that, when the number of training samples drawn according to this model is sufficiently high, incoherent bases 
are associated to local minima of Q. 

Figure u\ illustrates a typical cloud of N = 1000 points y„ — <&o x n S R d , d = 2, where x n was generated 
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according to this Bernoulli-Gaussian model with parameter p = 0.7 (cf Section [V]). Here the dictionary is a basis 
made of two atoms <pt = (cos sin 0^) T g M 2 , k = 0, 1, characterised by their angle 0£, and its coherence is 
/i = K^o, y*) | = | cos(0* — 8q)\ = 0.05. One can observe that, while many training samples are perfectly aligned 
with the lines generated by the two atoms of the dictionary, there is also a substantial proportion of "outliers" that 
do not have a sparse representation in the considered dictionary. 



N = 1000 Bernoulli-Gaussian training samples 

1 H 1 i 

+ + + + + 
+ ++, 

, + £ ++ + + + + + + 
Pf i ,+++ + + 

+ ?f>+ + + 

' + ir+v + > 
^ + f + + # 



Fig. 1. A cloud of N = 1000 training samples in R 2 . Each point is a column j/„ of the matrix Y = &qXo, where Xo was generated using 
the Bernoulli-Gaussian model of Section [V] with p = 0.7. 

For the same point cloud shown on Figure [TJ Figure [2] shows the value of the £j-cost C\{<&\Y) as a function 
of the angles 6 , 9i which parameterise the dictionary $ = [ip ,(pi], where tp k = (cos^,sin6'fc) T £ W 2 . One 
can observe that there are indeed local minima where they were expected to be located, i.e., at (#0, #1) = (Oq, 9\) 
and (f9o,f?i) = (9i,8q), which are associated to the ideal dictionary and its permuted version (the sign ambiguity 
is avoided by restricting the angles to the interval [0, it]). Moreover, despite the presence of many outliers in the 
training data, there is no other spurious local minimum. As a result, the global minima are found where they were 
expected, and none is missed. 

For the particular case K = d = 2, we ran a Monte-Carlo simulation where we varied the coherence \x of the 
dictionary and the Bernoulli-Gaussian parameter p - which is associated to the typical sparsity of the generated 
training samples - repeating a hundred times the random draw of Xq. Figure [3] displays the obtained results, in 
terms of empirical phase transitions. For small p (associated to training data with many sparse samples), the black 
regions indicate that the probability of missing an expected local minimum (as well as that of finding spurious one, 
or an erroneous global minimum) is very low, even if the coherence of the dictionary is very high. For larger values 
of p, associated to training data with more non-sparse outliers in the training set, the probability of error remains 
very small provided that the dictionary is sufficiently incoherent. An empirical rule of thumb seems that for small 
p, if fj, < 1 — p then the probability of learning errors is very small, provided that the number of training samples 
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Fig. 2. The value of the cost Ci(&\Y) as a function of the angles 9q, 9\ which parameterise the dictionary «I> = [ipoiVi], fk = 
(cos t9fc , sind^) T S K 2 . Because the cost function grows to infinity when 0\ — 9o is close to zero, we displayed — .1 /Cl(<1> V") instead, which 
has the same minima. 



is sufficiently large. 

Missed local minimum Spurious local minimum Wrong global mimimum 




Fig. 3. Observed empirical phase transitions for dictionary identification by i\ minimisation, when K = d = 2 and TV is large. Grey level 
indicates observed probability of error, from black (zero) to white (one). 

Fully characterising such phase transitions for learning over-complete dictionaries is a difficult task, for several 
difficulties arise at once, some due to the possible overcompleteness and non-orthogonality of the dictionary, others 
due to the difficulty of globally characterising the optima of a globally nonconvex problem which we know admits 
exponentially many solutions because of the permutation and sign indeterminacies. The analytic and probabilistic 
machinery we set up in the next sections provides tools to significantly progress towards this ambitious goal. 
In particular, even though the considered Bernoulli-Gaussian model may seem simplistic (it does not account for 
"compressible" training samples, where x n is not exactly sparse but only well approximated with few terms; neither 



March 1, 2010 



DRAFT 



9 



does it account for noise y n = & x n + e„), we believe it is a good warm up tool to understand : a) in which 
conditions the l\ -criterion can be robust to non-sparse outliers; and b) whether dictionary identification is feasible 
using a limited number of samples. As we will see, fortunately, the answer to both questions is positive (but 
mathematically somewhat technical), under proper assumptions. 

III. Local Minima 

Instead of directly characterising the local mimina of the original problem ( |P1[ ) we consider the related problem 

min \\X\\ X . (PI') 

(*,X)|#6X>,#X=r 

It is intimately connected to the initial problem ( |P1| |. 

Remark 3.1: We let the reader check the following facts. 

• When $ is a basis (K = d), the problem {FT} is fully equivalent to the problem ( |P1[ ), in the sense that if 
$ is a local (resp. global) minimum of jPl\ , then the pair (<&,<fr -1 Y) is a local (resp. global) minimum of 
jPV\ , and vice-versa. 

• When 4> is overcomplete (K > d), 

- if $ is a local (resp. global) minimum of the original problem ( |P1| >, then there is a coefficient matrix X 
such that the pair (<&,X) is a local (resp. global) minimum of {FT}- 

- if (#, X) is a global minimum of ( |Pr[ ), then $ is a global minimum of ( |P 1 [ >. 

Just as in the representation problem ([2}, where the £i-cost is not a smooth function of x as soon as x has at least 
one zero entry, the cost in Equation ( |P1'| | is not a smooth function of (€>,A) whenever X has at least one zero 
entry. Therefore, one cannot fully characterise the local minima of the cost function QPV) as a subset of the zeros 
of a 'gradient' of the ^i-cost function with respect to (<fr, X), for this gradient is not even well defined in a standard 
sens£| 

Here, on the opposite, we want to understand the effect of the non-smooth behaviour of the cost function, and to 
exploit it to characterise its local minima. For that we will develop a replacement for the 'gradient' which accounts 
for the fact that the ^i-cost function indeed admits one-sided directional derivatives everywhere. To keep the flow 
of the paper, we postpone most proofs and technical lemmata to the appendix. 

A. Basic Notations 

We denote by A„ the set indexing the zero entries of the ?i-th column x n of Xq, and A = {(n, k), 1 < n < 
N,k 6 A„} the set indexing all zero entries in X . The notation^] x k is for the fc-th row of X , and A & is the set 
indexing the columns with a zero entry in x k . 

For any K x N matrix A and index set fi C [1, KJ x [1, NJ, the notation Aq will refer ubiquitously either to the 

3 Even the notion of Gateaux derivatives is not applicable to this cost function, which may be a reason why a standard numerical approach 
1291 is to smooth it. 

4 We will generally distinguish column vectors from row vectors using subscripts vs superscript indices. 
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vector {A k . n )(k,n)efi 01 tne K x N matrix which matches A on ft and is zero elsewhere. The cardinality of is 
denoted 



B. Block Decomposition of the Considered Matrices 



In Appendix |B| we provide a full characterisation of local minima (Lemma B.3i which is sharp but somewhat 
abstract. To make its meaning more explicit, it is useful to consider the following block decompositions of the 
coefficient matrix X Q (see Figure Hh: 













x k 





Fig. 4. Block decomposition of the matrix Xq with respect to a given row x . Without loss of generality, the columns of Xq have been 
permuted so that the first |A fc | columns hold the nonzero entries of x k while the last |A fe | hold its zero entries. 



• x k is the fc-th row of Xq; 

• A fe is the set indexing the nonzero entries of x k and A the set indexing its zero entries; 

• s k is the row vector sign(x fc )Afc; 

• Xk (resp. Xk) is the matrix obtained by removing the fc-th row of X Q and keeping only the columns indexed 
by A fe (resp. A k ) . 

We also define the fc-th column of the off-diagonal part of the Gram matrix M = 3>o<I?o — I and 

fh k :=({<Pi, c Pk))i< i < K47 t k (6) 
the fc-th column of this matrix without the zero entry corresponding to the diagonal. Finally, we consider the vectors 

u k := X k (s k Y - Amg(\\x l \\i)i<i><K^k ■ m k . (7) 

C. A Necessary Condition, and a Sufficient Condition 

Equipped with these notations, we can now state the following necessary condition. 

Theorem 3.1 (Necessary condition): Consider a complete dictionary <&o G 2?« an d a coefficient matrix Xq such 
that <&qX = Y. Assume that X is the minimum l\ norm representation of Y. With the above defined notations: 

a) if ($o, Xo) is a local minimum of {FT]); or 

b) if $o is a global minimum of (jPTJ; 
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then we have 

max SU pfe^<l. (NC) 

k Z7 t ||X*z||i 

As a matter of fact, condition ( |NQ is almost sufficient to ensure that we have a local minimum, at least in the 
restricted case where <fr is a basis, i.e., K = d. 

Theorem 3.2 (Sufficient condition, case of a basis, K — d): Consider a basis matrix €>o with unit columns and 
a coefficient matrix Xo such that 3>qXq = Y. Assume that 

max SU p^H<l. (SC) 
k z ^o \\Xp\\l 



Then (<&o, Xo) is a strict local minimum of $PV) . 

It remains an open question whether this type of condition is also sufficient in the case of overcomplete 
dictionaries. We conjecture that the answer is positive when the constant 1 on the right hand side of ( |SC| > is 
replaced by a sufficiently smaller value, under some additional assumptions relating the sparsity of Xo and the 
null space of $o- This will be the object of further studies. For the time being, we wish to obtain a more explicit 
understanding of the meaning of conditions ( |NC| )-( |SC] l, and to characterize nontrivial collections X for which they 
are satisfied for reasonable dictionaries. In the next section we discuss the geometric interpretation of ( |NQ -( |SC| l. 

IV. Geometric interpretation 



Using a duality argument (Lemma B.5 in the Appendix) we first observe that for any vector v E R , we have 

z?Q \\X%z\\l 

if, and only if, there exists a vector d with ||rf||oo < 1 such that v = X k d. In other words, condition <|8j holds if 
the vector v e M A_1 belongs to the convex poly tope obtained by projecting the high-dimensional unit hypercube^] 
Q := {d, \\d\\oo < 1} using the matrix Xf.. 

The second observation is that the first summand in the definition of the vector Uk (cf Eq. |7])), which is the 
vector 

v k :=X k (s k y, (9) 

is a simple weighted sum of colums of X k . Indeed, denoting X^ (resp. X^) the matrix made of the columns of 
X k for which x n (k) is positive (resp. negative), the vector v k is the difference between the sum of the columns of 
X~£ and the sum of those of X^ . 

5 We chose to denote the hypercube Q while, technically, it depends on the considered dimension |A & | and will be denoted Ql A I when 
needed. 
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A. Orthonormal Dictionaries 

Assume for a moment that the reference dictionary 4>o is an orthonormal basis. Then, we have Mo = and 
therefore fhk = and U}. — Vk for all k. The necessary condition ( |NQ then simply reads: for each k, the vector 
Vk must lie within the convex polytope XkQ. This is illustrated on Figures [|]and [6] in dimension K = 3, so that 
the vector Vh as well as all the columns of Xk and Xk live in W 2 . Both figures were obtained using training data 
drawn according to the Bernoulli-Gaussian model described in Section [V] Figure [5] corresponds to relatively sparse 



CDlmimb of X: 



Fig. 5. Geometric depiction, when K = 3, of the condition JNCj. The data was drawn according to the Bernoulli-Gaussian model described 
in Section [V] with p = 0.5 and N = 20. 



data (the parameter of the Bernoulli-Gaussian model is p = 0.5) and we can observe that despite the relatively low 
number of training samples (N = 20) the vector Vk does belong to the polygon XkQ'- the necessary condition ( |NC[ ) 
is satisfied for the considered index k, and on the same data we checked that it is also satisfied for the other two 
indexes. Since the vectors are indeed strictly inside the considered polygons, the sufficient condition ( |SC| l is also 
satisfied. 

On the contrary, Figure [6] corresponds to data with many non-sparse outliers (p — 0.9) and one can observe that 
despite the larger number of training samples (N = 100), the vector Vk does not belong to the polygon XkQ: the 
necessary condition (|NC|> is not satisfied. 



B. Robustness to Dictionary Coherence 

One can observe on Figure |5]l that the vector Vk is well inside the convex polytope XkQ- If we choose some 
1 < q < oo, one way to quantify this fact is to say that Vk has a small £ g -norm ||ufc|L compared to the radius of 
the largest £ g -ball that is included in XkQ- From the definition of the vector Uk (cf Eq. |7|), it follows that if the 
vector 

diag(||ar £ ||i)i<^<ii-,^fc • fh k 
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Fig. 6. Geometric depiction, when K = 3, of the condition jNCj. The data was drawn according to the Bernoulli-Gaussian model described 
in Section|V] with p = 0.9 and N = 100. 



also has a small ^-norm (which is the case when $o is not necessarily orthogonal but sufficiently "incoherent"), 
then Ufe is close to Vk, hence also lies in the polytope X^Q. We then conclude that conditions ( |NC| l-( |SCl ) hold 
true. In other words, these conditions are robust to a certain level of dictionary coherence provided that: 

a) each polytope X^Q contains a "large" £ g -ball; 

b) each vector Vf. has "small" £ q -norm; 

c) each row x k of Xo has "small" £i-norm. 

Lemma B.6 in the appendix states that the radius of the largest £ g -ball included in all XkQ is given by 

aJXo) := min inf „ m , ( 10 ) 

k z^Q \\z\\ q > ' 

where 1 < q' < oo satisfies l/q + 1/q 1 = 1. We also define 

P q (X ) := max||u fc || 9 , (11) 

k 

7(X ):=max||a:*||i. (12) 

k 

We can now state the following theorem. 

Theorem 4.1: Consider 1 < q < oo and a K x N matrix X . The conditions ( |NC) >-( [SCt > are satisfied provided 
that the dictionary $o G T) is "incoherent", in the sense that 

At,(*o) := max m fc , < — f (13) 

k 1(X ) 



In particular, if 4>o is an incoherent basis (K = d), then the optimisation problem jPV) with Y := &0X0 admits 
a strict local minimum at (<&,X) = (&o,Xq). 
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Compared to Theorems 3.1 and 3.2 the above Theorem now decouples the assumptions on the coefficient matrix 
Xq from those on the dictionary $o- This will considerably simplify the analysis since we now "only" need to 
estimate the three quantities a q (Xo), /3 q (Xo) and 7(Xo). While the last two quantities are explicit and easy to 
compute for a given Xq, a q (X ) is a bit more difficult to compute for a specific X . In Section [V] we show how 
to estimate its typical value when Xq is drawn according to a Bernoulli-Gaussian model. 



C. Discussion: Choice of q. 



Notice that Theorem 4.1 involves a parameter 1 < q < oo. One may obtain coherence conditions that may be 
either very restrictive on the dictionary or quite weak, depending on the choice of q. As we illustrate below with 
a few examples, the nature of the training data can have a substantial influence on the "right" choice of q. 



N = 2000, p = 0.1 
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[ ] Polygon X k Q 
+ Columns of X\ 
x Columns of x; 
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Fig. 7. Shape of the polytope X k Q, K = 3, p = 0.1 and N = 2000. The data was drawn according to the Bernoulli-Gaussian model 
described in Section [V] and is highly sparse. The shape is close to a cube. 



1) Highly sparse training data: For a Bernoulli-Gaussian coefficient matrix Xq associated to small p (highly 
sparse data with few non-sparse outliers), as illustrated on Figure [7] the polytope XkQ seems to be roughly shaped 
(when the number N of training samples is large) as a cube in Therefore, the radius of the largest included 
£ g -ball is almost independent of q, i.e., a q (X Q ) is almost constant. 

Note that a q (Xo), /3 g (Xo) and p q (Xo) are always non-increasing functions of q. If a q (Xo) were actually constant, 
choosing q = 00 in Eq.([T3| would lead to the weakest possible incoherence condition which would read in terms 
of the well known coherence of the dictionary 

/ji s 1/ \| „ a oo(^o) — Poo(Xq) 

Moo *o :=mttK.\{(pk,<Pt)\ < ttt^ ■ 

2) Almost not sparse training data: However, the behaviour of a q (Xo) as a function of Xq heavily depends on 
the nature of the training data, which determines the size and shape of the polytopes XkQ- Indeed, for Bernoulli- 
Gaussian data associated to a large p (data with many non-sparse outliers), XkQ seems rather shaped (when TV is 
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a q (X ) 



large) as a Euclidean ball in M. K 1 , as illustrated on Figure [8] Therefore, for such data we expect that 

a 2 , q<2 
a 2 ■ (K~ l)-(i/2-i/9) ) q >2. 
As a result, q = 2 is essentially the best choice among 1 < q < 2, but all choices 2 < q < oo remain a priori 
possible, depending on the behaviour of f3 q (X ). 




-300 -250 -200 -150 -100 -50 



100 150 200 



Fig. 8. Shape of the polytope X^Q, K = 3, p = 0.9 and N = 2000. The data was drawn according to the Bernoulli-Gaussian model 
described in Section [V] and is almost not sparse. The shape is close to a Euclidean ball. Note the axis coordinate which indicates that the size 
of the ball is somewhat smaller than in Figure [7] for the same number of training samples but p = 0.1. 



V. Probabilistic Analysis 

In this section we will derive how many training signals are typically needed to ensure that a sufficiently incoherent 
basis constitutes a local minimum of the l\ -criterion, given that the coefficients of these signals are drawn from a 
certain probability distribution. 

From a Bayesian perspective, it would seem natural to consider the Laplacian distribution: minimising the Ni- 
cest function corresponds to maximising the likelihood of <& under a Laplacian prior. However, when drawing 
coefficients from a Laplacian distribution, the probability of observing a zero entry is zero. Therefore, under the 
Laplacian prior, the minimum of the ^i-cost function might be close to <&o but cannot be exactly located at <J>o, no 
matter how many training samples are drawn. For this reason, we choose to consider coefficients drawn according 
to a Bernoulli-Gaussian distribution, which ensures a nonzero probability 1 — p > of observing zero entries. 
In a sense, the setting we consider is similar to the hypotheses of the first papers on Compressed Sensing and 
sparse recovery iflOl . Ifl4l . |9l , where ill-posed linear inverse problems are solved by t\ -minimisation under an 
exact sparsity assumption. The difference here is that the model we consider also allows a certain proportion of 
non-sparse "outliers" in the training samples, as previously illustrated in Figure [T] 

A. The Bernoulli-Gaussian Model 

We assume that the entries x^ n of the KxN coefficient matrix X are i.i.d. with x^ n — £,kngkn, where the £fc n are 
indicator variables taking the value one with probability p and zero with probability 1 — p, i.e. £ ~ p5i + (1 — p)Sq. 
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The variables g n k follow a standard Gaussian distribution, i.e. centered with unit variance. 

The important role of the indicator variables is to guarantee a strictly positive probability that the entry Xkn is 
exactly zero. The assumption that the g n k are centered Gaussians with unit variance is made mainly for simplicity 
reasons as it allows us to do all proofs using only elementary probability theory. However, we believe that the same 
results hold for many other distributions as long as they show a certain amount of concentration. 



B. Asymptotic Coherence Condition 



From Theorem 4.1 we know that we have to determine a, j3 and 7 so that with high probability 

a) for all k, the image XkQ^ A ' of the unit cube by the linear map Xk contains a large £ g -ball: 

a q {X ) > a 

b) for all k, the vector Xk(s )* has small £ q norm: 

P g (X ) < P, 

c) for all k, the A;-th row x k has small l\ norm 

7(^o) < 7- 

In Appendix C]|D we derive estimates for a, /3, 7 and the associated probabilities using an fVball, i.e. q — 2. Our 
main tools are concentration of measure results to bound the probability that a random variable deviates significantly 
from its expected value. We obtain probability bounds exponentially small in N using 

a a Np(l-p)^ 
P « y/NKp 
7 w Np^l 

yielding, in the asymptotic regime of large N, coherence constraints of the type 

/i 2 (*o) < 1 - p. 



C. N on- Asymptotic Result - Required Number of Training Samples 

More specifically, we wish to quantify which number N of training samples guarantees, with high probability, that 
a basis is locally identifiable by l\ minimisation. The following theorem, whose proof can be found in Appendix [E] 
provides an answer to this question. 

Theorem 5.1: Let X be an A' x iV matrix drawn according to the Bernoulli-Gaussian model described in 
Section 



V-A 



with parameter p < 4/5. Assume that N > 2 (i- P ) 2 tnat *° ^ s an incoherent basis such that 

//,(*„) I - 1> \/~. (14) 
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Then # is locally identifiable from Y := &oX by i\ -minimisation, except with probability at most 

^(fk.^-wi-rtdfizM), (15) 

where < e < 1/5 is chosen as large as possible under the constraint 

M*o) <(l-p)-(l-5e) 

Note that we only require p < 4/5 to give a simple probability bound. Similar estimates also hold for p > 4/5, 
see proof in Appendix |E| 

In the theorem above, note that we need Np(l — p)e 2 > K to have failure probability smaller than one in (15) . 
The failure probability will rapidly approach zero as soon as the number of training signals TV is larger than a 
constant times 

KlogK 
p(l — p)e 2 

Considering that, in order not to have a trivial sparse solution, where the columns of $ are scaled versions of the 
training samples y„, we need at least K + 1 training samples, this is not a large requirement. 
Example: consider <&o a basis of R K made of 1 < I < K/2 (resp. K — £) vectors from an orthonormal basis <l?i 
(resp. 4> 2 ) where <fr 2 is maximally incoherent with ifTUl , |[T4ll . It is easy to check that /i 2 («I?o) = 1 — &/K < 1, 
hence $q is, with high probability, a local minimum of the ^-criterion with Y = $>qXo when Xq is drawn 
according to the Bernoulli Gaussian model with p < t/K < 1/2. 

VI. Discussion 

We have developed necessary and sufficient algebraic conditions on a dictionary coefficient pair to constitute a 
local minimum of the ^i-dictionary learning criterion. In case the dictionary is an incoherent basis we have shown 
that for coefficient matrices generated from a random sparse model the resulting basis coefficient pair suffices these 
conditions with high probability as long as the number of training signals grows like dlogd. These are exciting 
new results but since dictionary learning is a relatively young field they lead to more open questions. 

For the special case when the dictionary is assumed to be a basis a helpful result for practical purposes would 
be to prove that under the random model there exists only one local minimum which then has to be the global 
one, and could be found with simple descent algorithms. Numerical experiments in two dimensions support this 
hypothesis, as shown in Figure [2] where the only two local minima are at the original dictionary 3>o an d at the 
dictionary corresponding to $o with permuted columns. 

It would be also desirable to show the converse direction, i.e. if the coherence of the basis is too high and 
the training signals are generated by the same random sparse model, the basis coefficient pair will not be a local 
minimum. Again, this is empiricaly the case as shown in Figure [3] To answer this question from a theoretical 
perspective, it will first be necessary to investigate for which q the ^ g -ball most resembles the image of the unit 
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cube under X^. In the proof here we used q = 2 but there are some indications that q = oo is the more appropriate 
choice, which could also lead to a sharper version of the current result. Ideally we could then show that, as soon 
as a basis has coherence max/; ||mfe|| g higher than (1 — p), it is extremely unlikely to be a local minimum. 

Finally much harder research will have to be invested to extend the current results to the overcomplete and the 
noisy case. In the overcomplete case, the null space has to be taken into account, which prevents a straightforward 



generalisation from the intrinsic necessary and sufficient conditions of Lemma B.3 to explicit sufficient conditions 
as in Theorem |3.2| In the noisy case, even the formulation of the problem has to be changed as we cannot expect 
the best dictionary for the noise contaminated training data to be exactly the same as the original dictionary but 
only close to it. 

Appendix A 
Notations 

To state the main lemmata we need to introduce the following notation conventions. 



Froebenius norm and inner product. 

For any matrix, A* denotes the transpose of A. We let (A, B)p = tiacc(A* B) denote the natural inner product 
between matrices, which is associated to the Froebenius norm \\A\\ F = (A, A)p, and sign(A) is the sign operator 
applied componentwise to the matrix A (by convention sign(O) :— 0). All proofs will rely extensively on the fact 
that 

(AB, C) F = trace(B*A*C) = tmcc{A*CB*) = (A, CB*) F (17) 
and similar relations such as 

(diag(A), B) p = (A, dmg(B)) F . (18) 



Zero-diagonal & diagonal decomposition. 

We will use the following simple lemma. 

Lemma A.l: Consider A,B two matrices and let A = Zi + Ai, B = Z2 + A2 be their unique decomposition 
into a sum of a zero-diagonal and a diagonal matrix. Then 

diag(AB) = Ai A 2 + diag(ZiZ 2 ). 

Proof: The product of a zero-diagonal matrix with a diagonal matrix is zero-diagonal, hence Z1A2 and A1Z2 
are zero-diagonal and 

diag(AB) = diag((Z! + A X )(Z 2 + A 2 )) 

= diag(ZiZ 2 + AiZ 2 + Z1A2 + AiA 2 ) 
= diag(ZiZ 2 ) + AiA 2 . 
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□ 

For any dictionary <J> G 2?> we w iU consider in particular the decomposition of the Gram matrix ^g^o mt0 a 
diagonal part and a zero-diagonal part: 

A := diag(*$*o) = diag(||v>*||§) = I, (19) 
M := *** -I- (20) 

Null space 

We denote by A/"(<fr) the null space of the dictionary <I>, i.e. the linear subspace made up of all column vectors 
v G M. K such that <frv = 0. By abuse of notation, we will also denote A/" (4?) the linear space of K x N matrices 
V such that *V = 0. 

e-cover 

A finite e-cover of the unit £ 9 -sphere in R™ is a finite set X of points with unit £ 9 -norm such that for all points in 
the sphere, i.e. ||a;|| g = 1, we have 

min \\x - Xi\\ q < e. 

From Lemma 4.10 in ||2TI we know that for e € (0, 1) there always exists an e-cover X with cardinality \X\ < 

(3/e)'\ 

Appendix B 
Tangent spaces and local minima 

To characterise whether (&o,Xq) is a local minimum of ( |P1'[ ), we will use the notion of the tangent space 
Tr^ Qi x )M(Y) to the constraint manifold 

:={($,i),$eD,#i = y} (21) 

at the point (&q,Xq). We characterise this tangent space before providing the characterisation of the local minima. 
A. The Tangent Space T($, 0t x )-M(Y) 

The tangent space T/$, 0t x )M(Y) to the constraint manifold A4(Y) at the point (<&o,Xo) is the collection of the 
derivatives {&,X') := (*'(0), X'(0)) of all smooth functions e ^ (#(e),X(e)) which satisfy Ve, (*(e),X(e)) € 
M(Y) and (*(0),X(0)) = (* ,^o). 

Below we characterise the tangent spaces T^ T> and T(# 0i x )A / f(Y'). The characterisations use the decomposition 
$o$ = I + Mq introduced in Equations ([T9|)-(|2"0"|), through the notion of admissible matrices: a square K x K 
matrix C is said to be admissible if <!>' := 4>q ■ C G Tq> V. 
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Lemma B.l: Let $ G O be a complete dictionary. 

• Any matrix £ T^ T> can be written as = <J> ' C f° r some admissible C. 

• The matrix C is admissible if, and only if there exists a zero-diagonal matrix Z such that 

C = Z - diag(M Z) (22) 

Proof: The first claim is a trivial consequence of the completeness of $>o, which shows that any matrix can be 
written as 4>o 1 C, and the definition of an admissible matrix. 

The constraint in Q can be rewritten as diag(<fr*<I>) = I. Taking the derivative, it follows that <£' € T<s> T> if, and 
only if, diag(<l>Q<I>') = 0. Writing = 3>o • C and decomposing C = Z + A into a zero-diagonal and a diagonal 
matrix, we obtain from Lemma |A.1| 

diag(#**') = diag(*** -C) =diag((M + I)(Z + A)) 
= A + diag(M Z). 

Hence * • C E 7* P if, and only if, A = - diag(M Z), i.e. if C = Z - diag(M Z). 

□ 

Lemma B.2: The pair (&',X') is in the tangent space T^ ^x )-M(Y) an d onrv if» there exists an arbitrary 
admissible matrix C and an arbitrary element V of 7V(4?o) such that 

*' = *o-C* (23) 
X' = -C*X +V. (24) 

Proof: Given the nature of the constraint manifold M(Y), its tangent space at (<&q,Xo) is made up of all the 
pairs ($',X') such that <!>' g T$> V and $'X + $o^' = 0, meaning €>' = <fr ■ C with some admissible C, and 
*o(CX + X') = 0. The latter is equivalent to CX Q + X' e W(*o)- □ 

B. Characterisation of Local Minima 

Lemma B.3: Consider a complete dictionary 3>o G 2>> and a coefficient matrix Xo such that 3>qXo = Y. Define 
the K x K matrix 

U := sig n (X )X*-MSdiag(||a ; fe || 1 ). (25) 

a) If for every zero-diagonal Z and V e A/"(*o) such that ZXq + V ^ we have 

|(Z,U) F + (V,sign(X )) F | < IKZXo+VyK, (26) 

then (<&o,Xo) is a strict local minimum of ( |P1'| |. 

b) If the reversed strict inequality holds in (|26j> for some zero-diagonal Z and some V e 7V(«I?o) such that 
ZX + V 7^ 0, then (<fr , -^o) is not a local minimum of |FFJ. 
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Proof: Denote a(e) = 6(e) when lim e _j. ||a(e) — 6(e)||/|e| = 0. Consider any smooth function e i-> (<fr(e), A(e)) € 
M(Y). By definition we have X(e) = A + eX' , and for small e, the sign of A(e) matches that of A = A(0) 
on the support A of Xq, hence we may write 

\\X\U = (X,sign(X)) F 

= \\(X-X )z\\i + (X,sign(X )) F 
= WiX-Xo)^ 

+{X-X ,siga{X )} F + \\X \\ 1 , 
\\XWx - IIXqIIx = ||(A--X ) s: ||i + <X-Xo,sign(Xo))F 
= \e\-\\(XW\ 1+ e(X',sign(X )) F . 

As a result, the one-sided derivatives of the l\ -criterion in the tangent direction (<!>', X') are 

Vi, x4 X h := li m - g* 

= +||(A') x ||i + (A',sign(A )) F 
V. .,11*11, := lim ll^ll-llXolk 



e-s-0,e<0 £ 

|(A')Al|i + (A',sign(A )) F , 



and the £i-criterion admits a local minimum at (<&o,Ao) if for all (<I?',A') in the tangent space T^ g x )-M(Y) 
with X' we have 

|(A',sign(A )) F | < ||(X')aIIi- 

Vice-versa, the ^-criterion does not admit a local minimum at (<I?o, Xo) if there exists some (<!>', A') in the tangent 
space T(3> g ^x )-M(Y) yielding the reversed strict inequality. 



Using Lemma B.2 we get that the ^-criterion admits a local minimum at (&o, Xq) if for all admissible C and all 



V e A/ r (*o) such that V ^ CX we have 

|(CA +V,sign(A )) F | < IKCAo+VyiL (27) 

The rest of the proof consists in rewriting ( p7| ) using Lemma [B~T| and the properties ( fT7) and ( [IS) , 
First, using ( [17) , the inequality in ( |27] > is equivalent to 



|(C,sign(A )A*) F + (V,sign(A )) F | < \\(CX + V) x 



All i- 



Second, by Lemma B.l the admissible matrices are exactly the matrices C — Z — diag(MoZ), with Z an arbitrary 
zero-diagonal matrix. Since (A ■ Ao)^- = for any diagonal matrix A, we get (CXo)j; — (ZAo)^ for any 
admissible matrix. The inequality is therefore equivalent to 

|(Z - diag(M Z),sign(Ao)A *) F + (V, sign(A )) F | 

< IKZJfo+vyii, (28) 
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with arbitrary zero-diagonal Z and V e A/"(3>o)- 

Third, since diag(sign(Xo)Xg ) = diag(||a;' c || i), we observe using (jT7]» and ( fT~8] > that 

(diag(M Z) ) sign(X )X *) F = (M Z, diag(sign(X )X *)) F (29) 

= (Z,M$diag(|| a; fc || 1 )) F . 

Hence the inequality in ( p8| is equivalent to 

| (Z, sign(X )X* - M* diagdlx^lli))^ + (V, sign(X )) F | 

<||(ZX +V) r || 1 . 

□ 



C. Proof of Theorems \3.1\ and \3.2\ 

Lemma B.4: Using the notations of Section [III] we have 

|(Z,U)| \(u k ,z)\ 
Sup \\ 7X \ I =m S x Sup ' ||y*.|l • (30) 

Z^O ||(. ZJi; 0jA|li k 0GR K - I \{O} ll^fc^lll 

Proof: Denote z k the k-th row of the zero diagonal matrix Z: it is a row vector in M. K with a zero entry at the 
k-th coordinate, and we denote z k the row vector in IR^" 1 obtained by removing this zero entry. Observe that the 
k-th row of ZX is z k X = z k X k where X k is X with the k-th row removed. As a consequence the denominator 
in Eq. ( p0| ) is decomposed into the sum 

||(ZJ^o)aIIi = E II C***,)^ 111 = II (2*^0 ) X * 111 

k k 

= ^iiz fc (^viii = En^iii- < 31 > 

k k 

Now we decompose the numerator into a similar sum. First, we observe that 

(Z,MSdiag(||x fc || 1 )) F = ^(z fc ,™^diag(||^|| 1 ) 1 <,<^) 

fc 

= E •( ffe ' ?fi fc dia g(ii :z:f iii)i<^<^/#fc>' 

k 

(Z J sign(* )-Xo>f - (ZX ,sigii(Xo))F 
= ^(z fc X ,sign(x fc )) 

k 

= 5> fc X fc , S ign(^)}. 
fe 

Then, by matching column permutations of Xq and sign(x fc ) we get 

(z k X k ,sign(x k )) = (z k [X k ;X k },[s k ;0]) = {z k X k ,s k ) 
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and conclude that the numerator is 

\{Z,U)\ = \J2{z\<)\- (32) 

k 

The conclusion is then straightforward. □ 



Proof: [Proof of Theorem |3.1| Using Lemma B.3 and Remark 3.1 we know that if 4>q is a local minimum of |PT 



or a global minimum of [5] then for any zero-diagonal matrix Z and any V 6 A/"(^o) sucn that ZX |V / 
we have |(Z,U) + (V, sign(X )) | < ||(ZX + V)^|| r In particular, for any Z ^ and V = 0, we have 
|(Z,U)| < ||(ZXo)j|L. We conclude using Lemma 



B.4 



□ 



Proof: [Proof of Theorem 3.2 1 When $o is a basis, the null space is Af(&o) = {0}, and Condition ( f2"6| ) is satisfied 
for all nonzero zero-diagonal matrices Z and V e Af(G>o) such that ZXq + V ^ if, and only if, for all nonzero 

□ 



zero-diagonal matrix Z we have |(Z,U)f| < ||(Z^o)aI|i- Again, we conclude thanks to Lemma B.4 



D. Duality Analysis 

The next lemma exploits duality to understand the geometric meaning of conditions in ( |NC| >-( [SC] >. The following 
Lemma is used with the matrix A = Xj. to obtain the equivalent characterization of |8]l used in Section IV 
Lemma B.5: Let A be an n x M matrix with rank n. For any vector v define 

(v, z) 

MU : = SU P h x^-ij • (33) 
z ^o \\A*z\\i 

We have the equivalent characterisation 

= min ||d||oo, under the constraint Ad = v. (34) 

Proof: We will just prove that 

IMU < rnin ||d||oo, under the constraint Ad = v. 



The reversed inequality is more technical but only requires casting both norm characterisations (|33|l-(|34|i to a pair 
of linear programs in primal and dual form, and using the strong duality theorem to show that both programs, 
which are bounded and feasible, have the same value of the optimum. To check the easy inequality, take any d 
such that Ad = u . Since A has rank n, we have ||A*z||i ^ whenever z ^ 0. Thus, for any z ^ we have 

(v,z) = (Ad,z) = (d,A*z) < Halloo • hence \\v\\ A < \\d\loo. □ 

Lemma B.6: Consider A an n x M matrix and 1 < q, q' < oo with 1/q+l/q' = 1. The radius of the largest £ q 
ball included in AQ M is 

\\A*z\\^ 

fl ^ :=i S f „Vr' (35) 

z^O \\Z\\ q / 

Proof: If A is not of rank n we let the reader check that R q (A) = is also the radius of the largest ball included 



in AQ . Otherwise, from Lemma 



B.5 



we know that v G AQ M if and only if sup^ ij^!'^ < 1- The inclusion 
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of an l q ball of radius a in AQ is therefore equivalent to 

SUp SUp — — < 1 . 

|H|,<a z^O \\A*Z\\i 

Conclude by rewriting the left hand side: 

1(^)1 

a sup -— — 7— — a sup 



V%<1^\\A*Z\\! ^oll^Hl' 

□ 



E. Proof of Theorem 4.1 



Using the definition of Uk, 0q(Xo), j(Xq) and // g (3>o) (°f Eqs. |7]), ( fTT) , ( p~2] > and ( fl3] l) and the assumption on 
(j, q ($>o) (Eq. ( fT3) l) we have for all fc 

IKHg < ||Ufe||g+7(X ) • Ai 9 (*o) 

< /3,(X ) + 7(^0) • M 9 (*o) < a fl (X ). 



Hence, by definition of a q (Xo) the vector Uk belongs to XkQ for all k, and we conclude using Lemma B.5 



that 



the condition ( |SC| > is satisfied. In particular, if $0 is a basis then we conclude using Theorem 3.2 that (4>o, Xq) is 
a local minimum of dPl'b. 



Appendix C 
Probability estimates 

A. Typical Size of \\x h \\i 

The typical size of 7(^0) = maxj. ||x fe ||i can be directly derived from the following concentration of measure 
result. 



Theorem C.l: Let x be a vector of length N, whose entries follow the distribution described in Subsection V-A 
x n = £nffn, n = 1 ... N. Then for any e > 

Np-e 2 



F(\\x\\ 1 >Np(y/% + E ))<2'exp (- ^ _ 
It follows immediately, using a union bound, that with 

1 ;=Np(Jl + e), (36) 

we have 

P( 7 (*o) >l)<2K- exp (- 2 - ) ■ (37) 
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B. General Approach to Estimating a and (3 

Now we will estimate the probability that for one index k either a) or b) fails. Denote il k the event 

n k := {R q (X k ) < a} U {\\X k (s k y\\ q > 8}, 

i.e. either a) or b) fails for row k. Then SI = Ufcfifc is the undesired event {a q (Xo) < a} U {/3 q (Xo) > 0}. Using 

k 

a union bound over the row indices k and conditioning on the size of the set of zero entries |A | we get, 

"{il k | |A \ =M) -F{\/*' 

k,M 



»(fi) < 5>(fi k | |A fc | - M) ■ P(|A fe | = M) 



< K- max P(ft fc |A I = M) 

MG[M,,M U ] v 1 ' 

+ K-¥(\A k \^[M l ,M u }). (38) 

We start with the estimate of the second term in the sum above, the probability of the number of zero coefficients 
in a given row being below Mi or above M u . 

Lemma C.2: Consider < e < 1. Setting Mi = N(l - p)(l - e) and M u = N(l - p)(l + e) we get that 

P(|A fc | £ [M h M u ]) < 2exp(~2Ar(l -p) 2 s 2 ). (39) 

We will estimate the first term in ( f3~8"] > by splitting it into two terms that we will estimate separately 

¥(il k | |A fe | = M) < ¥(R q (X k ) < a \ \A k \ = M) 

+ V(\\X k (s k y\\ q > p | \A k \=M). (40) 

C. Typical Size of a q (X ) 

Now we estimate the typical size of the largest £ q ball we can inscribe into the image of the unit cube Q' A ' by 

_ k _ 

X k when |A | = M. For simplicity we write L for K — 1, and we denote A = X k . From Lemma 



B.6 



we know that 



we need to estimate the value of ||^4*z||i and compare it to \\z\\i. We begin with some geometrical observations. 

Lemma C.3: Let X = {z{\ be a finite e^-cover for the unit £ q > sphere in R L . Assume that we have both the 
lower bound 

\\A*Zi\\i > a, y Zl e x- 



and the upper bound 



Then R 00 (A) > a — Sex- 



|i4*||,'-n= sup ||^*«||i<*. 

II«IL/<1 



Proof: By Lemma B.6 we only need to show that for all z with unit l q i norm we have ||A*z||i > a — Sex- By 
definition of an ex -cover, for all z with unit £ q > norm we can find G X with \\z — Zi\\ q / < ex- We then have 

\ q '^i ■ \\z - Zi\\ q > ^ a- oe X - 
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□ 



We will therefore estimate a (typical) lower bound for the norm ||A*Zt||i, an d an upper bound on the operator norm 
||j4*||q/_>.i. We specialize to the case q — 2, but other bounds could be derived for other values of q. 
Lemma C.4: Let A = (A\ . . . Am) be a random matrix of size L x M, whose entries follow the distribution 

^ij9ij, i = 1 • • ■ L, j = 1 . . . M. Let z e R L be a vector with ||z|| 2 = 1. We 



described in Subsection 



V-A 



A; 



have the concentration bounds, for e > 0, 

P(||i4*|| a ->i > My/pL(l + e)) < 2exp (- 

P(||>4*z||i <Mp(yfl-e)) <2exp(^- 

Combining the above estimates we obtain 
Corollary C.5: Let < e < 1 and define 



Mp ■ e 2 
2 + ^2-e 

Mp-e 2 
2 + V2-S 



(41) 



:= Np(l - p)(l - e)( 



2e - e 2 ) 



(42) 



Then, for all M € [Mi M u ] we have 



<2 • 




Np{\ - p)(l - e)e 2 
2 + V2e 



(43) 



Proof: Given ex € (0, 1), we can choose an e^-cover X = {zA- for the unit £2 sphere in R L with \X\ < (Z/ex) L ■ 

ibuted as in 

MVpX(l 



For a random L x M matrix A = (A± . . . Am) distributed as in Lemma C.4 we have, combining Lemma C.3 with 



Lemma 



C.4 



and using a := Mp( ^J ^ — e) and 5 

F(R 2 (A)<a-5e x ) 



< p (W A * z ih < a ) +HU*h-+i > s). 

Mp ■ e 2 



< [(3/e x ) L + l]-2cxp - 



V2- 



2e — e 2 ). According to the probability split in (f38]>, we need to 



Setting ex = e^Jp/L yields a — Sex = Mp( 
find the maximum of the above expression for M £ [Mi,M M ] which is achieved at M = Mi = N(l — p)(l — e). 



□ 



D. Typical Size of ||X fe (s' c )" r || 9 

We now estimate the size of ||Xfe(s fc )*|| g . We need the following theorem. 

Theorem C.6: Let B be a random matrix of size L x n, whose entries follow the distribution described in 



Subsection V-A Bij = £ijgij, i = 1 . . . L, j = 1 . . . n, and s be a vector of length n with entries sj = ±1, 
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j = 1 ... re. Then for e' > 



P(||-Bs||l > Lnp(l +e')) < 2exp 



Lp(e 



l\2 



6 + 2s' 



Applying this to the situation at hand, inserting L = K—l and the worst case value for n = N—Mi = N(p+e—ep) 
and setting e 1 = (N/L)e we get: 



Lemma C.7: Define 



For any M £ [Mi , M u ] we have 



0:=JVp^(*=I + e )(l + | - e ), 



(||X fc ( s fc ril2>/3||A K | = M) 



(44) 



<2 • exp 



Npe 2 







K-l 
N 



2e 



(45) 



Appendix D 
Concentration Inequalities 

Here we will sketch the proofs of the concentration inequalities used in the previous section. They are based on 
a special version of Bernstein's inequality, see e.g. Q. 

Theorem D.l: Let Y^, i = 1 . . . M, be independent random variables with 

1 



E(Y/)<v 2 and E(\Yi\ k ) < -M v 2 c k - 2 , k > 2 



(46) 



Then 



M 

»(|^(y 2 -E(K t ))| >e) <2ex P ( 



i=l 



2(Mv 2 + ce) 
We will also use Hoeffding's inequality. 

Theorem D.2 (Hoeffding's inequality): Let Yi . . . Y/v be independent random variables. Assume that the Y n are 
almost surely bounded, meaning for 1 < i < N we have P(Y n € [a n , b n ]) — 1. Then, for the sum of these variables 
S = Yx + . . . + Yn we have the inequality 

2N 2 t 2 

P(5 - E(5) > iVt) < ex P (-— ^— -), 

which is valid for positive values of t. E(S) is the expected value of S. 



A. Proof of Lemma C.2 



i-rki 



In each row of X, the number of zero coefficients |A"| is N minus the number of non-zero coefficients |A fc |, 
which is the sum of the indicator variables Yln^kn- The are taking only the values zero and one, so we can 
use Hoeffding's inequality with <Zj = 0, 6, = 1 and £k n ) = pN, leading to 



P(|A fc | -pN > Nt) < exp(-2Nt 2 ). 
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Choosing t = (1 — p)e and using |A | = N — \A k \ we get 

P(|A fc | < JV(l-p)(l-e)) <exp(-2AT(l-p) 2 e 2 ). 

To bound the converse probability that |A fe | is very large, we set Y n = 1 — and again i = (1 — p)e to get 
directly to 

P(|A fc | >JV(l-p)(l + e)) < exp(-2AT(l -p) 2 e 2 ). 

B. Proof of Theorem \C.1\ 

Since ||sg||x = £3 i=1 we will use the Bernstein inequality with Yi = £j- \gi\. The moments of ^ are constant 

equal to p. The random variable \g.j\ follows a Chi-distribution of degree 1 so its moments are 

E(lffi|feH2l Tfr (47) 

Especially, we have E(Yj) = P\j\ and E(|K;| 2 ) = p, and using the recurrence relation for the Gamma function 
T(t + 1) = tT(t) and v^AXj) = < 1 we can bound by induction the moments of Yi for k > 2 as 

A-' 

E(|r,| fc )<p.^, fc>2, (48) 
so the moments suffice Condition |46) with c = 1/ v2 and we get 

/- / £ 2 

Pflblh > NpJ^+e) < 2 ■ exp =- 

Setting e = Mf> • e' yields the result. 



C. Proof of Lemma C.4 — first part 



To bound ||A*||2_>-i we begin by using the crude bound ||j4*|| 2 ^i = ||A||i_>.2 < II^Mh- We set Y^ — 

II^M|2 = GZy=i £ij9ij)^ ■ All Yi are identically distributed so for the analysis we can drop the subscript i. We can 
calculate directly 

L 

E(r 2 )-E(^e^ 2 )-^. 
.7=1 

For the higher order moments k > 2 we use a little trick to separate the expectation over £ and g, 



EF fc = E 9 E ? ( £ £f <?)) 5 = E 3 (( ]T 5 E« 

.7=1 ' ^ 



The fraction in the last expression is always smaller than 1 so for k > 2 we have 

EY* < E g (( £ , 9 2 ) *E, (^)) = p • E 9 (( £ 5 2 ; 

The random variable Y" = ( flj) 2 follows a Chi-distribution of degree L so for its fc-th moments we have 
the formula 

TV k+L \ 

E(Y k ) = 2^ { 2 ' 
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A long and tedious calculation involving the recurrence formula for the Gamma function, Stirling's formula and 
treating both cases, k is even respectively odd, yields the bound E(Y k ) < {^) k / 2 k\. This leads to E(Y k ) < p^ k 2 k, 
meaning that the higher order moments follow the decay condition in ( ftp} for c = y/L/2. Together with the 
following bound for the first order moment, 



e 2 



E(Y) < E(F 2 )3 = ^pL, 

we get 

To get the version of the formula used in Section M simply set e = My/pL ■ e' and observe that since p < 1 

e 2 _ M^p(e') 2 Mp(e') 2 



2{MpL + e^/T7/2) 2^/p + y/2e' 2 + v^e' 

D. Proof of Lemma \C.4\ - second part 
To lower bound we expand it as 

M M n M 

i— 1 i— 1 j — 1 i=l 

The random variables all follow the same distribution so it suffices to calculate the moments of Y = \ Y^j=i £j9j z j\- 
Define Y — J2j=i£j9j z j- Since the are i.i.d. zero mean Gaussians with variance a 2 = 1, Y is zero mean 
Gaussian with variance a 2 — Y^j=i z j£f : ~ ll^lll anc ^ we § et 



3 = 

•innrp rp- = 

3=1 3 S 3 

k\ tp 1 1 V" I k I \ n? ( 1 1 1 „ 1 1 1 „ l*n ur f 1 1 „ 1 1 1 fe "\ tw n„ \k\ 



E(\Y\ k ) = E(\Y\ k \) = E(\Uh ■ gif) = MM*) ■ M\9i\ k ) (49) 
Since ||z£|| 2 < ||z|| 2 = 1, we have for k > 2 

n 

E^Uf,) KE^Wi) =E i (Y / ^) <p, 

3=1 

while for k = 1 we get 

n / n \ 

E € (K|| 2 )=E € >E( ^ =p. 

3=1 \j=l / 

Again, |</i| is Chi -distributed of degree 1 so its moments are given by |47} and the moments of Yi are thus bounded 
by ( |48| l, which suffices the decay condition in ( ftp} for c = 1/ v2. As a result 

< ME(IYI) - e) < 2cxp ( - =- 

Together with the bound for E(|F|) > Py^, setting e = Mp ■ e' leads to the final form of the bound used in 
Section [V] 
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E. Proof of Theorem \C.6\ 

We expand \\Bs\\\ = Y%=i I { B \ s ) where B ' denotes the j-th row of B. and set Y t = \{B\ s)\ 2 = 
Q3j=i £ij9ij s j) 2 ' Since the Yi are again identically distributed we drop the subscript i for the analysis. First 
we get, 



E(Y) = E((]TGS, S ,) ) = K{E$9i s i ^<>- " 

3=1 3=1 



Observe that J2£j9j s j i s a g am Gaussian and distributed like ' 9i = ICIb ■ <7i- Hence, 



n 

E(y fc )=E((^^ Sj ) 2fe ) =E ? E g (||e||^ fe ) 



21. 



- E ?UK 2 



)E s (<? 2fc ) 



For the even Gaussian moments we have the formula E 9 (g 2fe ) = l^lx, while the term depending on £ can be 
bounded as 

n -, n 



m\t\\?)=*((Y,G) h )= ni M£Y,i 
3=1 

-1 _n_ 



3=1 



' - '< - n k ■ p, 



3=1 



leading to E(Y k ) < pn k ^^f. Especially for k = 2 we have E(Y 2 ) < 3pn 2 and so for k > 2 we can estimate 

E{Y k ) < 3pn 2 -n k - 2i 



< ... < 1 E(y 2 )(2n) fc - 2 A : !, 
2 k kl ~ ~ 2 V A ' 



meaning that the moments follow the decay condition in (|46]l with c = In and therefore 

P(||Bs||2 > Lnp + e) < 2exp 



6pn 2 L + 2ne 



Again setting e — Lnp ■ e' leads to the final version. 



Appendix E 
Proof of Main Theorem 

First, we observe that if p < 4/5 and K/N < 1/3 all the appearing exponentials can be upper bounded by 

c^(-Np(l-p) £2{1 2 2£) ). 
Therefore, with the definition of a, /3, 7 in |42]), |44} and ([36} we obtain from Lemmata C.5 C.7 C.l that we have 



a 2 (X ) - (3 2 (X ) > a -/3 
7(^0) ~ 7 



except with probability at most 



2K 



K 

e v p 



K 



■ exp 



e 2 (l-2e) 



AK exp (f log (f£) - JVp(l - p) 



(50) 
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Next, observe that for the right hand side to be smaller than 1, we need that e < 1/2 and Np(l — p)e 2 > K. 
Consequently 

K/N < p{l-p)e 2 < 1/16, 

meaning that whenever K/N > 1/3 the probability bound is trivially true, and we only need to assume p < 4/5. 



Now, from Theorem 4.1 we know that any sufficiently incoherent basis satisfying maxj. ||?7ifc||2 < i a — 
will therefore be locally identifiable by l\ minimization, except with probability at most equal to the right hand 
side in ( |50} , 

Inserting the values for a, f3,j from ( |42) , ( |44| i and ([36} we can lower bound the maximally allowed coherence 
(a — /3)/7 with 



(1 - p)(l - e)Ui -2e- e 2 ) - J(§ + e )(l + f - e) 



>(l-p).(l- 5e )-^| (f + e)(l + f). 
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